As people who work in “data science” note, about 90% of data science is getting data into a form that can be analyzed. In this chapter we will work with some tools that help with data wrangling.
Chapter 3 is an introduction to the use of Wickham’s dplyr which is a major part of the tidyverse. All of the things that dplyr helps us do can be carried out using base R — after all, dplyr is written in R. However, the formatting of commands when using dplyr makes the wrangling process far less painful.
To recreate the examples presented in the text we need to load the package tidyverse that contains the dplyr package and the flight data from the nycflights13 package.
#detach(package:nycflights13)
p_load(nycflights13)
#detach(package:tidyverse)
p_load(tidyverse)
Beware of conflicts. As noted above, the stats::filter and stats::lag functions are “replaced” by the dplyr versions. This is not a problem if we understand that the functions that are masked are no longer the default functions.
We now have access to the flights data that is part of the nycflights13 package. The flights data frame contains 19 variables on 336776 flights that took place out of NY City in 2013. To see what is in the flights data frame we enter its name.
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Note that this is a “tibble” which is a tweaked data frame. It is a good thing too. If this had been a simple data frame, we would have had a problem with it trying to print all of the variables and observations. Maybe a safer method would have been to use the base R head.
head(flights)
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
The listing of variables includes variable types — something you dealt with in CS 110. The text says that there are chr, dbl, int, and dttm. We will also see lgl, fctr, and date. Instead of dttm we see S3 POSIXct which is another (more flexible) date-time format — wait for lubridate.
Since R is object oriented, it treats types well/appropriately. We can override the type.
### Create local copy
dep_delay <- flights$dep_delay
### Determine type
storage.mode(dep_delay)
## [1] "double"
### Compute summary statistics
summary(dep_delay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -43.00 -5.00 -2.00 12.64 11.00 1301.00 8255
### Convert to integer
storage.mode(dep_delay) <- "integer"
### Check conversion
storage.mode(dep_delay)
## [1] "integer"
### Compute summary statistics
summary(dep_delay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -43.00 -5.00 -2.00 12.64 11.00 1301.00 8255
### Remove local copy
rm(dep_delay)
dplyr is SQL similar. The basic functions presented in the text will take care of many of the problems that we encounter. Used with group_by and a few “joins” we should be set.
The dplyr function filter subsets a data frame (or tibble) by selecting only those observations that satisfy a set of constraints. The arguments to filter are a data frame followed by comparison expressions (booleans).
We can subset the flight tibble by date. Pick a birthday, say my dog Oliver’s — July 5. We can look to see which flights took place on 7/5.
filter(flights, month == 7, day == 5)
## # A tibble: 822 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 5 38 2359 39 416 350
## 2 2013 7 5 106 2245 141 336 135
## 3 2013 7 5 500 500 0 633 640
## 4 2013 7 5 531 536 -5 804 806
## 5 2013 7 5 538 540 -2 836 840
## 6 2013 7 5 538 545 -7 812 816
## 7 2013 7 5 542 545 -3 954 921
## 8 2013 7 5 553 600 -7 835 851
## 9 2013 7 5 553 600 -7 635 655
## 10 2013 7 5 554 600 -6 821 834
## # ... with 812 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Note that filter responds by retuning a tibble with 822 rows and 19 columns. While this is nice, we tend to want to store a copy, so we should assign the result.
### Just assign the subset to jul5
flights_jul5 <- filter(flights, month == 7, day == 5)
dim(flights_jul5)
## [1] 822 19
### Assign and list the subset using surrounding parentheses
(flights_jul5 <- filter(flights, month == 7, day == 5))
## # A tibble: 822 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 5 38 2359 39 416 350
## 2 2013 7 5 106 2245 141 336 135
## 3 2013 7 5 500 500 0 633 640
## 4 2013 7 5 531 536 -5 804 806
## 5 2013 7 5 538 540 -2 836 840
## 6 2013 7 5 538 545 -7 812 816
## 7 2013 7 5 542 545 -3 954 921
## 8 2013 7 5 553 600 -7 835 851
## 9 2013 7 5 553 600 -7 635 655
## 10 2013 7 5 554 600 -6 821 834
## # ... with 812 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
rm(flights_jul5)
The text discusses comparison operators. As in CS 110, the typical R comparisons are:
The logical operators that R uses are similar to those you saw in CS 110.
Combining “not”, “and”, and “or” allows us to subset using just a few statements.
The book discusses DeMorgan’s law. In the lingo of R we can express this as !(A & B) = (!A) | (!B) = !A | !B. Alternatively, we get !(A | B) = !A & !B.
Like it or not, observations often contain missing values. In R these are indicated by NA. NAs can be created during import or through computation. Note that NAs are not the same as Inf or a NaN.
x <- NA
x
## [1] NA
3*x + 2
## [1] NA
y <- NA
x < y
## [1] NA
x == y
## [1] NA
1/0
## [1] Inf
log(-1)
## Warning in log(-1): NaNs produced
## [1] NaN
Filters typically drop observations where the comparison evaluates to FALSE or NA. You can keep NAs by using is.na().
x <- c(1, NA, 4, 3)
y <- c("b", "a", NA, "d")
(dat <- tibble(x,y))
## # A tibble: 4 x 2
## x y
## <dbl> <chr>
## 1 1 b
## 2 NA a
## 3 4 <NA>
## 4 3 d
rm(x, y)
filter(dat, x > 1)
## # A tibble: 2 x 2
## x y
## <dbl> <chr>
## 1 4 <NA>
## 2 3 d
filter(dat, is.na(x) | is.na(y))
## # A tibble: 2 x 2
## x y
## <dbl> <chr>
## 1 NA a
## 2 4 <NA>
filter(dat, !is.na(x) & !is.na(y))
## # A tibble: 2 x 2
## x y
## <dbl> <chr>
## 1 1 b
## 2 3 d
Base R has sort and order that can be used to rearrange the rows of a data frame or tibble. The tidyverse contains the arrange function which makes it easiser to reorder rows without losing order in columns. The desc() option
arrange(dat, x)
## # A tibble: 4 x 2
## x y
## <dbl> <chr>
## 1 1 b
## 2 3 d
## 3 4 <NA>
## 4 NA a
arrange(dat, desc(y))
## # A tibble: 4 x 2
## x y
## <dbl> <chr>
## 1 3 d
## 2 1 b
## 3 NA a
## 4 4 <NA>
rm(dat)
Note that NA is ordered as last.
The select function allows us to subset a dataset by selecting those variables in which we are interested. In larger datasets, this can help us use storage more efficiently.
Note: filter lets us reduce row dimension and select lets us reduce column dimension.
### Make some data
x <- rep(1:20, each=3)
x <- matrix(x, ncol=20)
colnames(x) <- letters[1:20]
colnames(x)
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t"
x <- data.frame(x)
### Now play with subsetting of columns
###
### We can select by naming columns
select(x, a, d, c, g)
## a d c g
## 1 1 4 3 7
## 2 1 4 3 7
## 3 1 4 3 7
### Or by listing column values
select(x, 1, 4, 3, 7)
## a d c g
## 1 1 4 3 7
## 2 1 4 3 7
## 3 1 4 3 7
### Alternative listing
indices <- c(1, 4, 3, 7)
select(x, indices)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(indices)` instead of `indices` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## a d c g
## 1 1 4 3 7
## 2 1 4 3 7
## 3 1 4 3 7
rm(indices)
A really nice feature of R is the ability to exclude rows and columns. select allows us to use this feature.
indices <- 4:17
select(x, -indices)
## a b c r s t
## 1 1 2 3 18 19 20
## 2 1 2 3 18 19 20
## 3 1 2 3 18 19 20
### Or..
select(x, -4:17)
## Warning in x:y: numerical expression has 19 elements: only the first used
## a b c d e f g h i j k l m n o p q
## 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
### Oops, note that
-4:17
## [1] -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
### so try
select(x, -(4:17))
## a b c r s t
## 1 1 2 3 18 19 20
## 2 1 2 3 18 19 20
## 3 1 2 3 18 19 20
rm(indices)
There are some interesting helper functions that work with select. These functions will be useful when we have data that are repeated measurements
### Rename the columns to show selection methods
colnames(x) <- paste0(rep(c("X","Y","XX","YY"), each=5), rep(c(1:5), 4))
colnames(x)
## [1] "X1" "X2" "X3" "X4" "X5" "Y1" "Y2" "Y3" "Y4" "Y5" "XX1" "XX2"
## [13] "XX3" "XX4" "XX5" "YY1" "YY2" "YY3" "YY4" "YY5"
### Now select columns by pattern
select(x, starts_with("X"))
## X1 X2 X3 X4 X5 XX1 XX2 XX3 XX4 XX5
## 1 1 2 3 4 5 11 12 13 14 15
## 2 1 2 3 4 5 11 12 13 14 15
## 3 1 2 3 4 5 11 12 13 14 15
select(x, starts_with("YY"))
## YY1 YY2 YY3 YY4 YY5
## 1 16 17 18 19 20
## 2 16 17 18 19 20
## 3 16 17 18 19 20
select(x, ends_with("1"))
## X1 Y1 XX1 YY1
## 1 1 6 11 16
## 2 1 6 11 16
## 3 1 6 11 16
select(x, num_range("Y", c(1,3,5)))
## Y1 Y3 Y5
## 1 6 8 10
## 2 6 8 10
## 3 6 8 10
The use of matches will be discussed later.
rename is a “variant” of select that does not effect other column/variable names. While we can rename variables as we did above, the use of rename removes the need for keeping track of the indices of the columns.
### Change the names of the 5th observations the old school way
cnames <- colnames(x)
names(x)[c(5,10,15,20)] <- c("W1","W2","W3","W4")
names(x)
## [1] "X1" "X2" "X3" "X4" "W1" "Y1" "Y2" "Y3" "Y4" "W2" "XX1" "XX2"
## [13] "XX3" "XX4" "W3" "YY1" "YY2" "YY3" "YY4" "W4"
### Reset the columnames
(colnames(x) <- cnames)
## [1] "X1" "X2" "X3" "X4" "X5" "Y1" "Y2" "Y3" "Y4" "Y5" "XX1" "XX2"
## [13] "XX3" "XX4" "XX5" "YY1" "YY2" "YY3" "YY4" "YY5"
### Change the names of the 3rd observations the rename way
rename(x, A1 = X3, A2 = Y3, A3 = XX3, A4 = YY3)
## X1 X2 A1 X4 X5 Y1 Y2 A2 Y4 Y5 XX1 XX2 A3 XX4 XX5 YY1 YY2 A4 YY4 YY5
## 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
names(x)
## [1] "X1" "X2" "X3" "X4" "X5" "Y1" "Y2" "Y3" "Y4" "Y5" "XX1" "XX2"
## [13] "XX3" "XX4" "XX5" "YY1" "YY2" "YY3" "YY4" "YY5"
rm(cnames, x)
The mutate function allows us to create new variables. This, too, can be done using base R. However, mutate shortens the names of the variables and adds to readability.
We first look at base R.
### Make some data
x <- c(1, NA, 4, 3)
y <- c("b", "a", NA, "d")
(dat <- tibble(x,y))
## # A tibble: 4 x 2
## x y
## <dbl> <chr>
## 1 1 b
## 2 NA a
## 3 4 <NA>
## 4 3 d
rm(x, y)
### Creat a new variable that is the concatenation of y and x
dat$logx <- log(dat$x)
dat$x2 <- dat$x^2
dat$y1 <- lag(dat$y)
dat$divx <- 47 %/% dat$x
dat$modx <- 47 %% dat$x
dat$xeven <- (dat$x %% 2) == 0
dat$xcumsum <- cumsum(dat$x)
dat
## # A tibble: 4 x 9
## x y logx x2 y1 divx modx xeven xcumsum
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <lgl> <dbl>
## 1 1 b 0 1 <NA> 47 0 FALSE 1
## 2 NA a NA NA b NA NA NA NA
## 3 4 <NA> 1.39 16 a 11 3 TRUE NA
## 4 3 d 1.10 9 <NA> 15 2 FALSE NA
**mutate shortens the above.
### Retrieve the original dat
(dat <- select(dat, x, y))
## # A tibble: 4 x 2
## x y
## <dbl> <chr>
## 1 1 b
## 2 NA a
## 3 4 <NA>
## 4 3 d
mutate(dat,
lnx = log(x),
x2 = x^2,
y1 = lag(y),
modx = 47 %/% x,
divx = 47 %% x,
xeven = (x %% 2) == 0,
xcumsum = cumsum(x)
)
## # A tibble: 4 x 9
## x y lnx x2 y1 modx divx xeven xcumsum
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <lgl> <dbl>
## 1 1 b 0 1 <NA> 47 0 FALSE 1
## 2 NA a NA NA b NA NA NA NA
## 3 4 <NA> 1.39 16 a 11 3 TRUE NA
## 4 3 d 1.10 9 <NA> 15 2 FALSE NA
rm(dat)
As indicated in the text, summarize is really only helpful when used with group — which is explained in a later section. For now, we note that summarize allows us to get summary information from a tibble or data frame.
The text says that it will cover na.rm=TRUE soon. The short description is that rm refers to remove, and that any observation that is NA will be removed from the computation. Note that this means it is possible that the effective sample size has been reduced.
### Create some data
set.seed(47)
z <- rnorm(1000, 0, 1)
xbar <- rep(-4:5, each=100)
s <- rep(1:5, 200)
y <- z * s
x <- y + xbar
dat <- tibble(x, y, z, xbar, s)
### Find the means of all three variables
summarize(dat, xmean = mean(x, na.rm=TRUE), ymean = mean(y, na.rm=TRUE), zmean = mean(z, na.rm=TRUE))
## # A tibble: 1 x 3
## xmean ymean zmean
## <dbl> <dbl> <dbl>
## 1 0.558 0.0584 0.0279
### Find mean, std dev, and n for x then z
summarize(dat, xmean = mean(x, na.rm=TRUE), xsd = sd(x, na.rm=TRUE), n=n())
## # A tibble: 1 x 3
## xmean xsd n
## <dbl> <dbl> <int>
## 1 0.558 4.39 1000
summarize(dat, zmean = mean(z, na.rm=TRUE), zsd = sd(z, na.rm=TRUE), n=n())
## # A tibble: 1 x 3
## zmean zsd n
## <dbl> <dbl> <int>
## 1 0.0279 1.05 1000
rm(z, s, y, x, xbar)
As noted above, we can subset by group.
### Group the dat data by mean and std dev
dat_by_xbar_s <- group_by(dat, xbar, s)
### Compute grouped means for x, y, and z
summarize(dat_by_xbar_s, xmean = mean(x, na.rm=TRUE), ymean = mean(y, na.rm=TRUE), zmean = mean(z, na.rm=TRUE))
## `summarise()` has grouped output by 'xbar'. You can override using the `.groups` argument.
## # A tibble: 50 x 5
## # Groups: xbar [10]
## xbar s xmean ymean zmean
## <int> <int> <dbl> <dbl> <dbl>
## 1 -4 1 -3.94 0.0627 0.0627
## 2 -4 2 -3.83 0.166 0.0829
## 3 -4 3 -3.38 0.621 0.207
## 4 -4 4 -4.39 -0.388 -0.0970
## 5 -4 5 -3.99 0.0102 0.00204
## 6 -3 1 -3.05 -0.0504 -0.0504
## 7 -3 2 -2.69 0.315 0.157
## 8 -3 3 -2.90 0.0985 0.0328
## 9 -3 4 -0.540 2.46 0.615
## 10 -3 5 -1.79 1.21 0.241
## # ... with 40 more rows
### Compute grouped sds for x, y, and z
summarize(dat_by_xbar_s, xs = sd(x, na.rm=TRUE), ys = sd(y, na.rm=TRUE), zs = sd(z, na.rm=TRUE))
## `summarise()` has grouped output by 'xbar'. You can override using the `.groups` argument.
## # A tibble: 50 x 5
## # Groups: xbar [10]
## xbar s xs ys zs
## <int> <int> <dbl> <dbl> <dbl>
## 1 -4 1 1.13 1.13 1.13
## 2 -4 2 1.62 1.62 0.809
## 3 -4 3 3.22 3.22 1.07
## 4 -4 4 4.61 4.61 1.15
## 5 -4 5 3.77 3.77 0.754
## 6 -3 1 0.775 0.775 0.775
## 7 -3 2 2.45 2.45 1.22
## 8 -3 3 3.44 3.44 1.15
## 9 -3 4 3.26 3.26 0.815
## 10 -3 5 4.46 4.46 0.892
## # ... with 40 more rows
### Now look at things for "fun"
dat_by_xbar <- group_by(dat, xbar)
summarize(dat_by_xbar, xmean = mean(x, na.rm=TRUE), ymean = mean(y, na.rm=TRUE), zmean = mean(z, na.rm=TRUE))
## # A tibble: 10 x 4
## xbar xmean ymean zmean
## * <int> <dbl> <dbl> <dbl>
## 1 -4 -3.91 0.0942 0.0515
## 2 -3 -2.19 0.806 0.199
## 3 -2 -1.66 0.336 0.135
## 4 -1 -1.04 -0.0416 -0.0200
## 5 0 0.222 0.222 0.0283
## 6 1 0.665 -0.335 -0.0885
## 7 2 1.91 -0.0940 -0.0335
## 8 3 3.14 0.141 0.0649
## 9 4 4.01 0.00955 0.0111
## 10 5 4.45 -0.554 -0.0685
summarize(dat_by_xbar, xs = sd(x, na.rm=TRUE), ys = sd(y, na.rm=TRUE), zs = sd(z, na.rm=TRUE))
## # A tibble: 10 x 4
## xbar xs ys zs
## * <int> <dbl> <dbl> <dbl>
## 1 -4 3.11 3.11 0.983
## 2 -3 3.21 3.21 0.995
## 3 -2 3.27 3.27 1.04
## 4 -1 3.62 3.62 1.12
## 5 0 3.68 3.68 0.982
## 6 1 3.76 3.76 1.06
## 7 2 3.08 3.08 1.03
## 8 3 3.87 3.87 1.13
## 9 4 3.42 3.42 1.01
## 10 5 3.89 3.89 1.13
dat_by_s <- group_by(dat, s)
summarize(dat_by_s, xmean = mean(x, na.rm=TRUE), ymean = mean(y, na.rm=TRUE), zmean = mean(z, na.rm=TRUE))
## # A tibble: 5 x 4
## s xmean ymean zmean
## * <int> <dbl> <dbl> <dbl>
## 1 1 0.436 -0.0639 -0.0639
## 2 2 0.830 0.330 0.165
## 3 3 0.555 0.0548 0.0183
## 4 4 1.02 0.522 0.131
## 5 5 -0.0510 -0.551 -0.110
summarize(dat_by_s, xs = sd(x, na.rm=TRUE), ys = sd(y, na.rm=TRUE), zs = sd(z, na.rm=TRUE))
## # A tibble: 5 x 4
## s xs ys zs
## * <int> <dbl> <dbl> <dbl>
## 1 1 2.97 0.975 0.975
## 2 2 3.75 2.19 1.09
## 3 3 4.20 3.11 1.04
## 4 4 5.03 4.25 1.06
## 5 5 5.49 5.25 1.05
summarize(dat_by_xbar_s, n=n())
## `summarise()` has grouped output by 'xbar'. You can override using the `.groups` argument.
## # A tibble: 50 x 3
## # Groups: xbar [10]
## xbar s n
## <int> <int> <int>
## 1 -4 1 20
## 2 -4 2 20
## 3 -4 3 20
## 4 -4 4 20
## 5 -4 5 20
## 6 -3 1 20
## 7 -3 2 20
## 8 -3 3 20
## 9 -3 4 20
## 10 -3 5 20
## # ... with 40 more rows
rm(dat, dat_by_xbar_s, dat_by_xbar, dat_by_s)
Because of the complexity of the flow of the text’s example, I am going to use the book’s example. So, we will look at flight delays.
Without piping, we could look at the relationship between delay and dest using the following code.
by_dest <- group_by(flights, dest)
delay <- summarize(by_dest,
count = n(),
dist = mean(distance, na.rm=TRUE),
delay = mean(arr_delay, na.rm=TRUE)
)
(delay <- filter(delay, count>20, dest != "HNL"))
## # A tibble: 96 x 4
## dest count dist delay
## <chr> <int> <dbl> <dbl>
## 1 ABQ 254 1826 4.38
## 2 ACK 265 199 4.85
## 3 ALB 439 143 14.4
## 4 ATL 17215 757. 11.3
## 5 AUS 2439 1514. 6.02
## 6 AVL 275 584. 8.00
## 7 BDL 443 116 7.05
## 8 BGR 375 378 8.03
## 9 BHM 297 866. 16.9
## 10 BNA 6333 758. 11.8
## # ... with 86 more rows
Noting that delays seem to decrease as flight distances increase, we might decide to plot the data — although the graph is the easier way to look for trends.
ggplot(data=delay, mapping=aes(x=dist, y=delay)) +
geom_point(aes(size=count), alpha=1/3) + # alpha controls transparency
geom_smooth(se=TRUE) +
theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Okay, we missed that the shorter flights do pretty well too.
The book notes that there are three basic steps to the above process: 1. Group obs by destination. 2. Compute average distance, average delay, and number of flights. 3. Drop Honolulu (HNL) which does not have the same traffic as a “normal” airport, and infrequent flights (<=20). While the approach of using intermediate tibbles/data frames is effective, the code needed to generate the final result is not as easy to read as it could be. This is where piping comes in.
Using dplyr we can pipe (%>%) intermediate results to the next opperation. The three steps from above now look like:
### Check for the intermediate data frames from previous computations
ls()
## [1] "by_dest" "delay"
rm(delay, by_dest)
ls()
## character(0)
delay <- flights %>% # Start with the flights data frame
group_by(dest) %>% # 1. Group by dest
summarize( # 2. Compute avg distance and delay, num flights
count = n(),
dist = mean(distance, na.rm=TRUE),
delay = mean(arr_delay, na.rm=TRUE)
) %>%
filter(count > 20, dest != "HNL") # 3. Drop "HNL" and infrequent flights
ls()
## [1] "delay"
Note that delay and by_dest were still sitting around from our previous computations. We remove them and then recompute using pipes which leaves no intermediate data frame (by_dest).
The new code has a flow to it that is easy to read. Using the text’s suggestion of replacing “%>%” with “then” makes the code read like a (run-on) sentence. “Let delay equal flights, then summarize…, then filter…”
Plotting could be done using ggplot2 on delay in the same way as before, or we can use the new ggvis package. Note the use of the pipe (%>%) with ggvis instead of the plus sign (+) used with ggplot2. And, since ggvis uses pipes, there are no residual tibbles sitting around.
### Install or load ggvis
if (!require(ggvis)){ # Install/load ggvis
install.packages("ggvis", dependencies=TRUE)
require(ggvis)
}
## Loading required package: ggvis
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'ggvis'
## Installing package into 'C:/Users/School/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
ls()
## [1] "delay"
rm(delay)
ls()
## character(0)
flights %>% # Start with the flights data frame
group_by(dest) %>% # 1. Group by dest
summarize( # 2. Compute avg distance and delay, num flights
count = n(),
dist = mean(distance, na.rm=TRUE),
delay = mean(arr_delay, na.rm=TRUE)
) %>%
filter(count > 20, dest != "HNL") %>% # 3. Drop "HNL" and infrequent flights
ggvis(~dist, ~delay) %>% # Use ggvis to plot the piped data
layer_points(size = ~count,
opacity := 1/3 # Set opacity statically or use
#, opacity := input_slider(0, 1, value = 1/3, label = "Opacity")
) %>%
layer_smooths(se = TRUE, stroke := "skyblue") %>%
add_legend("size",
title = "Number of Arrivals",
orient = "right"
)
## Error in add_legend(., "size", title = "Number of Arrivals", orient = "right"): could not find function "add_legend"
ls()
## character(0)
For more on ggvis see https://ggvis.rstudio.com/, http://jimhester.github.io/ggplot2ToGgvis/ and https://rpubs.com/jacobdickey_2016/223060. Remember that ggvis is currently in beta. At this point you are probably better off using ggplot2.
When we gather data from various sources and combine these data, we often generate a lot of missing values. While it is possible to impute these values, we usually ignore them — by tossing the entire observation.
We now take a look at the na.rm option. Reusing the data from our earlier discussion of NA we can look at how aggregation functions treat missing values.
### Build a test tibble
x <- c(1, NA, 4, 3)
y <- c("b", "a", NA, "d")
(dat <- tibble(x,y))
## # A tibble: 4 x 2
## x y
## <dbl> <chr>
## 1 1 b
## 2 NA a
## 3 4 <NA>
## 4 3 d
rm(x, y)
### Compute the mean of the numeric variable, x
mean(dat$x, na.rm=FALSE)
## [1] NA
mean(dat$x, na.rm=TRUE)
## [1] 2.666667
rm(dat)
Note that computing with NA’s does not generate an error. Having seen that arithmetic on NA’s generates NA’s, we should not be surprised that mean returns NA’s when the variable has NA’s in it. It is nice that na.rm=TRUE makes it possible for mean to return a value. However, as the book points out, we need to be careful about the number of observations that remain.
Now that we understand the how na.rm affects aggregation functions, we can move on to its use in summerize.
flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 NA
## 2 2013 1 2 NA
## 3 2013 1 3 NA
## 4 2013 1 4 NA
## 5 2013 1 5 NA
## 6 2013 1 6 NA
## 7 2013 1 7 NA
## 8 2013 1 8 NA
## 9 2013 1 9 NA
## 10 2013 1 10 NA
## # ... with 355 more rows
the default is na.rm=FALSE, we end up with a lot of NA’s. Use of na.rm=TRUE gives what appears to be more meaningful results.
flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay, na.rm=TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ... with 355 more rows
So, which observations were used in the computation of the grouped means? Use of the is.na function can help us figure this out.
### Subset the flights to those that were not cancelled and save for later
not_cancelled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
### Note that the means suggest that these are the same flights as were used above
not_cancelled %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay)) # Note that the variable "mean"" is being created
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.4
## 2 2013 1 2 13.7
## 3 2013 1 3 10.9
## 4 2013 1 4 8.97
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.56
## 9 2013 1 9 2.30
## 10 2013 1 10 2.84
## # ... with 355 more rows
### Figure out how many observations there are in each data frame
dim(flights)
## [1] 336776 19
dim(not_cancelled)
## [1] 327346 19
Having removed all of the NA’s, use of na.rm=FALSE is no longer a problem. Note that we have dropped 9430 observations because of missing values. We might wonder how these missing values are distributed.
The use of na.rm=TRUE or filtering using !is.na removes all of the observations with NA’s. If there is something systematic (non-random) in the way the NA’s were generated, then removal can greatly change our interpretation of the data. A first step in figuring out what is going on is looking at the number of missing and nonmissing values in the data set.
We continue looking at the example of not cancelled flights that we started above.
### Subset the flights to those that were not cancelled and save for later
not_cancelled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(delay = mean(dep_delay)) # Note that the variable "delay" is being created
ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10) +
theme_bw()
While there are not many, there are a few flights that have an average delay of over two hours. The reason for this is more logical than we might originally think. How would a single flight that is delayed for three hours be indicated on the plot? Single flight failures can screw up our interpretation. As noted in the text, a plot of number of flights by mean delay.
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(
delay = mean(dep_delay), # The variable "delay" is created
n = n() # n is the number of flights
)
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10) +
theme_bw()
We note the clear heteroscedasticity (lack of constant variability) across n. In fact, as n increases, variability decreases. This is something that is covered in most introductory statitstics courses where we learn that \(\widehat{\sigma_{\bar{X}}} = s / \sqrt{n}\). So, if we resample larger samples, our sample means will differ from each other less than if we resample smaller samples.
The text suggests that we toss out flights that do not occur on a regular basis — say fewer than 25 flights per year. If we replot the above graph we get.
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10) +
theme_bw()
The problem of decreasing variability is still appparent, but less drastic than before. What do the negative mean delays suggest?
The text’s next example is an excuse to use the Lahman baseball data. The lahman package gives access to seasonal statistics for MLB players and teams. For more information on analyzing MLB data see Jim Albert’s Teaching Statistics Using Baseball or Marchi and Albert’s Analyzing Baseball Data with R.
We start by changing the Batting data frame and its tibble equivalent.
p_load(Lahman)
is.data.frame(Lahman::Batting)
## [1] TRUE
head(Lahman::Batting)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4 0
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2 5
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0 2
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2 1
## 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0 1
## IBB HBP SH SF GIDP
## 1 NA NA NA NA 0
## 2 NA NA NA NA 0
## 3 NA NA NA NA 1
## 4 NA NA NA NA 0
## 5 NA NA NA NA 0
## 6 NA NA NA NA 0
(batting <- as_tibble(Lahman::Batting))
## # A tibble: 107,429 x 22
## playerID yearID stint teamID lgID G AB R H X2B X3B HR
## <chr> <int> <int> <fct> <fct> <int> <int> <int> <int> <int> <int> <int>
## 1 abercda~ 1871 1 TRO NA 1 4 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0
## 3 allisar~ 1871 1 CL1 NA 29 137 28 40 4 5 0
## 4 allisdo~ 1871 1 WS3 NA 27 133 28 44 10 2 2
## 5 ansonca~ 1871 1 RC1 NA 25 120 29 39 11 3 0
## 6 armstbo~ 1871 1 FW1 NA 12 49 9 11 2 1 0
## 7 barkeal~ 1871 1 RC1 NA 1 4 0 1 0 0 0
## 8 barnero~ 1871 1 BS1 NA 31 157 66 63 10 9 0
## 9 barrebi~ 1871 1 FW1 NA 1 5 1 1 1 0 0
## 10 barrofr~ 1871 1 BS1 NA 18 86 13 13 2 1 0
## # ... with 107,419 more rows, and 10 more variables: RBI <int>, SB <int>,
## # CS <int>, BB <int>, SO <int>, IBB <int>, HBP <int>, SH <int>, SF <int>,
## # GIDP <int>
Since we are interested in the batters themselves, we compute individual career batting averages. Note that PlayerID is a unique identifier for every person who has played in MLB.
batters <- batting %>%
group_by(playerID) %>%
summarize(
ba = sum(H, na.rm=TRUE)/sum(AB, na.rm=TRUE), # H is the number of hits
ab = sum(AB, na.rm=TRUE) # AB is the number of at bats
)
### Plot those batters with more than 100 career at bats
batters %>%
filter (ab > 100) %>%
ggplot(mapping = aes(x = ab, y = ba)) +
geom_point(alpha = 1/10) +
geom_smooth(se = TRUE) + # Look at the reliability of the loess smooth
theme_bw()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
rm(batters, batting)
The text points out the “obvious” heteroscedasticity and the positive association between ba and ab. By including the standard error of the smooth, we see that as the number of observations decreases, the reliability of the smooth decreases — players with higher ab values.
This section lists a number of summary functions that the authors deem “useful.” Included are measures of location, spread, rank, position, count, and counts and proportions of logical values. The latter is important enough that we review the book example below.
Note that the mad function discussed in spread computes the mean absolute value statistic \(\left(\mbox{mad} = {\sum_{i=1}^n \left|x_i - \bar{x}\right| \over n}\right)\). The functions in this section are actually useful and you should read the subsections carefully.
In working with the flight data we have grouped by dest, by tailnum, and by year and month and day. The first two of these groupings are univariate. The third is a group_by three variables. This section is devoted to the latter, multivariate grouping method.
The books example of daily flights is interesting in that it allows “roll ups”. A closer look at what is going on is warranted.
daily <- not_cancelled %>%
group_by(year, month, day)
dim(daily)
## [1] 327346 19
### Look at the number of flights per day
(per_day <- summarize(daily, nflights = n()))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day nflights
## <int> <int> <int> <int>
## 1 2013 1 1 831
## 2 2013 1 2 928
## 3 2013 1 3 900
## 4 2013 1 4 908
## 5 2013 1 5 717
## 6 2013 1 6 829
## 7 2013 1 7 930
## 8 2013 1 8 892
## 9 2013 1 9 893
## 10 2013 1 10 929
## # ... with 355 more rows
sum(per_day$nflights)
## [1] 327346
### Look at the number of flights per month --- a roll up of per day by month
(per_month <- summarize(per_day, nflights = sum(nflights)))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups: year [1]
## year month nflights
## <int> <int> <int>
## 1 2013 1 26398
## 2 2013 2 23611
## 3 2013 3 27902
## 4 2013 4 27564
## 5 2013 5 28128
## 6 2013 6 27075
## 7 2013 7 28293
## 8 2013 8 28756
## 9 2013 9 27010
## 10 2013 10 28618
## 11 2013 11 26971
## 12 2013 12 27020
sum(per_month$nflights)
## [1] 327346
### Look at the number of flights per year --- a roll up of per month by year
(per_year <- summarize(per_month, nflights = sum(nflights)))
## # A tibble: 1 x 2
## year nflights
## * <int> <int>
## 1 2013 327346
sum(per_year$nflights)
## [1] 327346
Note that a roll up can only occur in the reverse order of the group_by. That is, since we grouped by year, month, day — or day within month within year — we can get counts by day then month then year. The order of nesting is seen to be very important in our ability to roll up.
The book makes reference to a problem in rolling up other summaries based on other functions. For example, because the denominator used in finding means is the group size, a roll up of *unbalanced data can provide surprising results.
daily <- not_cancelled %>%
group_by(year, month, day)
### Look at the mean flight delay by day
(per_day <- summarize(daily, xbar = mean(arr_delay)))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day xbar
## <int> <int> <int> <dbl>
## 1 2013 1 1 12.7
## 2 2013 1 2 12.7
## 3 2013 1 3 5.73
## 4 2013 1 4 -1.93
## 5 2013 1 5 -1.53
## 6 2013 1 6 4.24
## 7 2013 1 7 -4.95
## 8 2013 1 8 -3.23
## 9 2013 1 9 -0.264
## 10 2013 1 10 -5.90
## # ... with 355 more rows
mean(per_day$xbar)
## [1] 7.005464
dim(per_day)
## [1] 365 4
### Look at the mean flight delay by month --- by averaging day averages
(per_month <- summarize(per_day, xbar = sum(xbar)/365))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups: year [1]
## year month xbar
## <int> <int> <dbl>
## 1 2013 1 0.508
## 2 2013 2 0.457
## 3 2013 3 0.524
## 4 2013 4 0.910
## 5 2013 5 0.328
## 6 2013 6 1.35
## 7 2013 7 1.43
## 8 2013 8 0.500
## 9 2013 9 -0.315
## 10 2013 10 -0.0268
## 11 2013 11 0.0133
## 12 2013 12 1.32
mean(per_month$xbar)
## [1] 0.5837887
### Look at the mean flight delay by month --- a roll up of per day by month
(per_month <- summarize(per_day, xbar = mean(xbar)))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups: year [1]
## year month xbar
## <int> <int> <dbl>
## 1 2013 1 5.98
## 2 2013 2 5.96
## 3 2013 3 6.17
## 4 2013 4 11.1
## 5 2013 5 3.86
## 6 2013 6 16.5
## 7 2013 7 16.8
## 8 2013 8 5.89
## 9 2013 9 -3.83
## 10 2013 10 -0.316
## 11 2013 11 0.162
## 12 2013 12 15.6
mean(per_month$xbar)
## [1] 6.985884
### Look at the number of flights per year --- a roll up of per month by year
(per_year <- summarize(per_month, xbar = mean(xbar)))
## # A tibble: 1 x 2
## year xbar
## * <int> <dbl>
## 1 2013 6.99
sum(per_year$xbar)
## [1] 6.985884
### Overall mean
mean(daily$arr_delay)
## [1] 6.895377
While 6.9858845 is close to 6.8953768, they are not the same. Nor were the results of the two methods of computing the monthly means close to the values computed using a univariate group_by on month. This is just a reminder to be careful about the use of roll ups.
ungroup is helpful when you screw up and lose your original, ungrouped tibble. If you are careful, you should not need to ungroup.
dim(not_cancelled)
## [1] 327346 19
dim(daily %>% ungroup())
## [1] 327346 19
rm(daily, not_cancelled, delays, per_day, per_month, per_year)
The text discusses the use of grouped mutates and filters to describe subsets of the dataset. I have seen SQL programmers use programming to create tables of means of subgroups within a large sample. They then use these table to “describe” — via listing means that differ from the others — patterns in the sample. This method is not efficient and the modeling tools that we will discuss do a far better job of describing datasets.
Rmember, group_by, mutate, and filter are intended to be used with piping. The idea is to make reading code more intuitive. The use of mutates within subsets of data created using grouping and filtering can become difficult to follow/interpret. Sometimes it is better to use old school programming or to break up the piping so as to make code more readable.
The “by hand” comparison of descriptives of subgroups is tedious and generally ineffective. Models are more effective ways of describing general trends seen in data. However, the use of descriptives of subgroups of the data as bases for graphical displays and testing can be very effective. We will discuss post hoc techniques that effectively supplement modeling.